195 research outputs found

    Efficient GPU-accelerated fitting of observational health-scaled stratified and time-varying Cox models

    Full text link
    The Cox proportional hazards model stands as a widely-used semi-parametric approach for survival analysis in medical research and many other fields. Numerous extensions of the Cox model have further expanded its versatility. Statistical computing challenges arise, however, when applying many of these extensions with the increasing complexity and volume of modern observational health datasets. To address these challenges, we demonstrate how to employ massive parallelization through graphics processing units (GPU) to enhance the scalability of the stratified Cox model, the Cox model with time-varying covariates, and the Cox model with time-varying coefficients. First we establish how the Cox model with time-varying coefficients can be transformed into the Cox model with time-varying covariates when using discrete time-to-event data. We then demonstrate how to recast both of these into a stratified Cox model and identify their shared computational bottleneck that results when evaluating the now segmented partial likelihood and its gradient with respect to regression coefficients at scale. These computations mirror a highly transformed segmented scan operation. While this bottleneck is not an immediately obvious target for multi-core parallelization, we convert it into an un-segmented operation to leverage the efficient many-core parallel scan algorithm. Our massively parallel implementation significantly accelerates model fitting on large-scale and high-dimensional Cox models with stratification or time-varying effect, delivering an order of magnitude speedup over traditional central processing unit-based implementations

    Massive Parallelization of Massive Sample-size Survival Analysis

    Full text link
    Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival regression models in such studies. In this paper, we use graphics processing units (GPUs) to parallelize the computational bottlenecks of massive sample-size survival analyses. Specifically, we develop and apply time- and memory-efficient single-pass parallel scan algorithms for Cox proportional hazards models and forward-backward parallel scan algorithms for Fine-Gray models for analysis with and without a competing risk using a cyclic coordinate descent optimization approach We demonstrate that GPUs accelerate the computation of fitting these complex models in large databases by orders-of-magnitude as compared to traditional multi-core CPU parallelism. Our implementation enables efficient large-scale observational studies involving millions of patients and thousands of patient characteristics

    Rewriting and suppressing UMLS terms for improved biomedical term identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.</p> <p>Results</p> <p>Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.</p> <p>Conclusions</p> <p>We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at <url>http://biosemantics.org/casper</url>.</p

    Using the data quality dashboard to improve the ehden network

    Get PDF
    Federated networks of observational health databases have the potential to be a rich resource to inform clinical practice and regulatory decision making. However, the lack of standard data quality processes makes it difficult to know if these data are research ready. The EHDEN COVID-19 Rapid Collaboration Call presented the opportunity to assess how the newly developed open-source tool Data Quality Dashboard (DQD) informs the quality of data in a federated network. Fifteen Data Partners (DPs) from 10 different countries worked with the EHDEN taskforce to map their data to the OMOP CDM. Throughout the process at least two DQD results were collected and compared for each DP. All DPs showed an improvement in their data quality between the first and last run of the DQD. The DQD excelled at helping DPs identify and fix conformance issues but showed less of an impact on completeness and plausibility checks. This is the first study to apply the DQD on multiple, disparate databases across a network. While study-specific checks should still be run, we recommend that all data holders converting their data to the OMOP CDM use the DQD as it ensures conformance to the model specifications and that a database meets a baseline level of completeness and plausibility for use in research.</p
    corecore